16. Implement Naive Bayes C++

Implementing Naive Bayes

In this exercise you will implement a Gaussian Naive Bayes classifier to predict the behavior of vehicles on a highway. In the image below you can see the behaviors you'll be looking for on a 3 lane highway (with lanes of 4 meter width). The dots represent the d (y axis) and s (x axis) coordinates of vehicles as they either…

  1. change lanes left (shown in blue)
  2. keep lane (shown in black)
  3. or change lanes right (shown in red)

Your job is to write a classifier that can predict which of these three maneuvers a vehicle is engaged in given a single coordinate (sampled from the trajectories shown below).

Each coordinate contains 4 features:

  • s
  • d
  • \dot{s}
  • \dot{d}

You also know the lane width is 4 meters (this might be helpful in engineering additional features for your algorithm).

Instructions

  1. Implement the train(data, labels) method in the class GNB in classifier.cpp .

    Training a Gaussian Naive Bayes classifier consists of computing and storing the mean and standard deviation from the data for each label/feature pair. For example, given the label "change lanes left” and the feature \dot{s} , it would be necessary to compute and store the mean and standard deviation of \dot{s} over all data points with the "change lanes left” label.

    Additionally, it will be convenient in this step to compute and store the prior probability p(C_k) for each label C_k . This can be done by keeping track of the number of times each label appears in the training data.

  2. Implement the predict(observation) method in classifier.cpp .

    Given a new data point, prediction requires two steps:

    1. Compute the conditional probabilities for each feature/label combination . For a feature x and label C with mean \mu and standard deviation \sigma (computed in training), the conditional probability can be computed using the formula here :
    p(x = v | C) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp^{-\frac{(v-\mu)^2}{2\sigma^2}}

    Here v is the value of feature x in the new data point.

    1. Use the conditional probabilities in a Naive Bayes classifier. This can be done using the formula here :
    y = \underset{k\in (1,\ldots, K)}{argmax } \,\,p(C_k) \prod^n_{i=1}p(x_i = v_i| C_k)

    In this formula, the argmax is taken over all possible labels C_k and the product is taken over all features x_i with values v_i .

  3. When you want to test your classifier, run Test Run and check out the results.

NOTE : You are welcome to use some existing implementation of a Gaussian Naive Bayes classifier. But to get the best results you will still need to put some thought into what features you provide the algorithm when classifying. Though you will only be given the 4 coordinates listed above, you may find that by "engineering" features you may get better performance. For example: the raw value of the d coordinate may not be that useful. But d % lane_width might be helpful since it gives the relative position of a vehicle in it's lane regardless of which lane the vehicle is in.

Helpful Resources

Extra Practice

Provided in one of the links below is python_extra_practice , which is the same problem but written in Python that you can optionally go through for extra coding practice. The Python solution is available at the python_solution link. If you get stuck on the quiz see if you can convert the python solution to C++ and pass the classroom quiz with it. The last link Nd013_Pred_Data has all the training and testing data for this problem in case you want to run the problem offline.

Start Quiz:

#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include "classifier.h"

using std::cout;
using std::endl;
using std::ifstream;
using std::string;
using std::vector;

// Helper functions to load .txt files
vector<vector<double> > Load_State(string file_name);
vector<string> Load_Label(string file_name);

int main() {
  vector< vector<double> > X_train = Load_State("./train_states.txt");
  vector< vector<double> > X_test  = Load_State("./test_states.txt");
  vector< string > Y_train = Load_Label("./train_labels.txt");
  vector< string > Y_test  = Load_Label("./test_labels.txt");
    
  cout << "X_train number of elements " << X_train.size() << endl;
  cout << "X_train element size " << X_train[0].size() << endl;
  cout << "Y_train number of elements " << Y_train.size() << endl;

  GNB gnb = GNB();
  
  gnb.train(X_train, Y_train);

  cout << "X_test number of elements " << X_test.size() << endl;
  cout << "X_test element size " << X_test[0].size() << endl;
  cout << "Y_test number of elements " << Y_test.size() << endl;
  
  int score = 0;
  for (int i = 0; i < X_test.size(); ++i) {
    vector<double> coords = X_test[i];
    string predicted = gnb.predict(coords);
    if (predicted.compare(Y_test[i]) == 0) {
      score += 1;
    }
  }

  float fraction_correct = float(score) / Y_test.size();
  cout << "You got " << (100*fraction_correct) << " correct" << endl;

  return 0;
}

// Load state from .txt file
vector<vector<double> > Load_State(string file_name) {
  ifstream in_state_(file_name.c_str(), ifstream::in);
  vector< vector<double >> state_out;
  string line;
    
  while (getline(in_state_, line)) {
    std::istringstream iss(line);
    vector<double> x_coord;
      
    string token;
    while (getline(iss,token,',')) {
      x_coord.push_back(stod(token));
    }
    state_out.push_back(x_coord);
  }

  return state_out;
}

// Load labels from .txt file
vector<string> Load_Label(string file_name) {
  ifstream in_label_(file_name.c_str(), ifstream::in);
  vector< string > label_out;
  string line;
  while (getline(in_label_, line)) {
    std::istringstream iss(line);
    string label;
    iss >> label;
    
    label_out.push_back(label);
  }
    
  return label_out; 
}
#include "classifier.h"
#include <math.h>
#include <string>
#include <vector>

using Eigen::ArrayXd;
using std::string;
using std::vector;

// Initializes GNB
GNB::GNB() {
  /**
   * TODO: Initialize GNB, if necessary. May depend on your implementation.
   */
  
}

GNB::~GNB() {}

void GNB::train(const vector<vector<double>> &data, 
                const vector<string> &labels) {
  /**
   * Trains the classifier with N data points and labels.
   * @param data - array of N observations
   *   - Each observation is a tuple with 4 values: s, d, s_dot and d_dot.
   *   - Example : [[3.5, 0.1, 5.9, -0.02],
   *                [8.0, -0.3, 3.0, 2.2],
   *                 ...
   *                ]
   * @param labels - array of N labels
   *   - Each label is one of "left", "keep", or "right".
   *
   * TODO: Implement the training function for your classifier.
   */
  
}

string GNB::predict(const vector<double> &sample) {
  /**
   * Once trained, this method is called and expected to return 
   *   a predicted behavior for the given observation.
   * @param observation - a 4 tuple with s, d, s_dot, d_dot.
   *   - Example: [3.5, 0.1, 8.5, -0.2]
   * @output A label representing the best guess of the classifier. Can
   *   be one of "left", "keep" or "right".
   *
   * TODO: Complete this function to return your classifier's prediction
   */
  
  return this -> possible_labels[1];
}
#ifndef CLASSIFIER_H
#define CLASSIFIER_H

#include <string>
#include <vector>
#include "Dense"

using Eigen::ArrayXd;
using std::string;
using std::vector;

class GNB {
 public:
  /**
   * Constructor
   */
  GNB();

  /**
   * Destructor
   */
  virtual ~GNB();

  /**
   * Train classifier
   */
  void train(const vector<vector<double>> &data, 
             const vector<string> &labels);

  /**
   * Predict with trained classifier
   */
  string predict(const vector<double> &sample);

  vector<string> possible_labels = {"left","keep","right"};
};

#endif  // CLASSIFIER_H

INSTRUCTOR NOTE:

Students have reported Load_State function in main.cpp is not loading data properly from train_states.txt and test_states.txt since istringstream is not implemented for comma (',') separated data. Several students have found working solutions here , that may help in determining your own solution.